feat: support sample-weighted aggregations for sampled trace data#1963
feat: support sample-weighted aggregations for sampled trace data#1963vinzee wants to merge 3 commits intohyperdxio:mainfrom
Conversation
TraceSourceSchema has a new optional field `sampleRateExpression`. When undefined, SQL aggregations are unchanged. But when set (e.g. `SpanAttributes['SampleRate']`), SQL aggregations are rewritten to correct for upstream 1-in-N sampling: aggFn | Before | After (sample-corrected) | Overhead -------------- | ---------------------- | --------------------------------------------------- | -------- count | count() | sum(weight) | ~1x count + cond | countIf(cond) | sumIf(weight, cond) | ~1x avg | avg(col) | sum(col * weight) / sum(weight) | ~2x sum | sum(col) | sum(col * weight) | ~1x quantile(p) | quantile(p)(col) | quantileTDigestWeighted(p)(col, toUInt32(weight)) | ~1.5x min/max | unchanged | unchanged | 1x count_distinct | unchanged | unchanged (cannot correct) | 1x Weight is wrapped as greatest(toUInt64OrZero(toString(expr)), 1) so spans without a SampleRate attribute default to weight 1 (unsampled data produces identical results to the original queries). Types: - Add sampleWeightExpression to ChartConfig schema - Add sampleRateExpression to TraceSourceSchema + Mongoose model Query builder: - Rewrite aggFnExpr in renderChartConfig.ts when sampleWeightExpression is set, with safe default-to-1 wrapping Integration (propagate sampleWeightExpression to all chart configs): - ChartEditor/utils.ts, DBSearchPage, ServicesDashboardPage, sessions - DBDashboardPage (raw SQL + builder branches) - AlertPreviewChart - SessionSubpanel - ServiceDashboardEndpointPerformanceChart - ServiceDashboardSlowestEventsTile (p95 query + events table) - ServiceDashboardEndpointSidePanel (error rate + throughput) - ServiceDashboardDbQuerySidePanel (total query time + throughput) - External API v2 charts, AI controller, alerts (index + template) UI: - Add Sample Rate Expression field to trace source admin form
|
@vinzee is attempting to deploy a commit to the HyperDX Team on Vercel. A member of the Team first needs to authorize it. |
🦋 Changeset detectedLatest commit: 23872e7 The changes in this PR will be included in the next version bump. This PR includes changesets to release 4 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
PR Review
|
add SampleRate migration for existing deployments, standardize sampleWeightExpression propagation, and document percentile approximation The initial sampling support (c61acd1) only added the SampleRate materialized column to the seed file, which runs on fresh installs. Existing deployments need an ALTER TABLE migration. Changes: - Add CH migration 000002 to create the SampleRate materialized column on existing otel_traces tables (IF NOT EXISTS, safe to re-run) - Standardize sampleWeightExpression propagation across 5 components (SessionSubpanel, ServiceDashboardDbQuerySidePanel, ServiceDashboardEndpointSidePanel, ServiceDashboardEndpointPerformanceChart, ServiceDashboardSlowestEventsTile) from direct assignment to conditional spread, matching the pattern used elsewhere - Note in SourceForm help text that percentiles under sampling use quantileTDigestWeighted (approximate T-Digest sketch)
Problem
High-throughput services can produce millions of spans per second. Storing every span is expensive, so we run the OpenTelemetry Collector's tail-sampling processor to keep only 1-in-N spans. Each kept span carries a
SampleRateattribute recording N.Once data is sampled, naive aggregations are wrong: count() returns N-x fewer events than actually occurred, sum()/avg() are biased, and percentiles shift. Dashboards show misleadingly low request counts, throughput, and error rates, making capacity planning and alerting unreliable.
Summary
TraceSourceSchema gets a new optional field
sampleRateExpression- the ClickHouse expression that evaluates to the per-span sample rate (e.g.SpanAttributes['SampleRate']). When not configured, all queries are unchanged.When set, the query builder rewrites SQL aggregations to weight each span by its sample rate:
Types:
Query builder:
spans without a SampleRate attribute default to weight 1 (unsampled
data produces identical results to the original queries).
is set, with safe default-to-1 wrapping
Integration (propagate sampleWeightExpression to all chart configs):
UI:
Screenshots or video
How to test locally or on Vercel
References